Text Mining - Introduction - Sens Recaps

(adapted from T. Kwartler's excellent course Text Mining: Bag of Words on DataCamp.com)

In this Notebook, you will be introduced to the basic notions of Text Mining using the tm and qdap R libraries. For most cells, the descriptions/lead-up can be found in the commented lines (in green -- comments starts with the # character). In some instances, explanations are provided separately.

The main dataset you will work with is the text content of Associated Press game recaps involving the Ottawa Senators during the 2016-2017 NHL season.

SECTIONS

  1. Initializng the Environment
  2. Introduction
  3. Sens Recaps
  4. VCorpus From Vector
  5. Pre-processing a Document With tm
  6. Pre-processing a Document With qdap
  7. Stopwords
  8. Word Stemming and Stem Completion
  9. Pre-processing a Corpus
  10. Document-Term Matrix
  11. Term-Document Matrix
  12. Barchart of Frequent Terms With tm
  13. Reprise
  14. Exercises

Back to top

INITIALIZING THE ENVIRONMENT

In [4]:
library('tm')   # R text mining library
library('qdap') # R quantitative discourse analysis package

Back to top

INTRODUCTION

In [2]:
new_text <- "The Ottawa Senators have the Atlantic Division lead in their sights.  Mark Stone had a goal and four assists, Derick Brassard scored twice in the third period and the Senators recovered after blowing a two-goal lead to beat the Toronto Maple Leafs 6-3 on Saturday night.  The Senators pulled within two points of Montreal for first place in the Atlantic Division with three games in hand.  We like where we're at. We're in a good spot, Stone said. But there's a little bit more that we want. Obviously, there's teams coming and we want to try and create separation, so the only way to do that is keep winning hockey games.  Ottawa led 2-0 after one period but trailed 3-2 in the third before getting a tying goal from Mike Hoffman and a power-play goal from Brassard. Stone and Brassard added empty-netters, and Chris Wideman and Ryan Dzingel also scored for the Senators.  Ottawa has won four of five overall and three of four against the Leafs this season. Craig Anderson stopped 34 shots.  Morgan Rielly, Nazem Kadri and William Nylander scored and Auston Matthews had two assists for the Maple Leafs. Frederik Andersen allowed four goals on 40 shots.  Toronto has lost eight of 11 and entered the night with a tenuous grip on the final wild-card spot in the Eastern Conference.  The reality is we're all big boys, we can read the standings. You've got to win hockey games, Babcock said. After Nylander made it 3-2 with a power-play goal 2:04 into the third, Hoffman tied it by rifling a shot from the right faceoff circle off the post and in. On a power play 54 seconds later, Andersen stopped Erik Karlsson's point shot, but Brassard jumped on the rebound and put it in for a 4-3 lead.  Wideman started the scoring in the first, firing a point shot through traffic moments after Stone beat Nikita Zaitsev for a puck behind the Leafs goal. Dzingel added to the lead when he deflected Marc Methot's point shot 20 seconds later.  Andersen stopped three shots during a lengthy 5-on-3 during the second period, and the Leafs got on the board about three minutes later. Rielly scored with 5:22 left in the second by chasing down a wide shot from Matthews, carrying it to the point and shooting through a crowd in front.  About three minutes later, Zaitsev fired a shot from the right point that sneaked through Anderson's pads and slid behind the net. Kadri chased it down and banked it off Dzingel's helmet and in for his 24th goal of the season. Dzingel had fallen in the crease trying to prevent Kadri from stuffing the rebound in.  Our game plan didn't change for the third period, and that's just the maturity we're gaining over time, Senators coach Guy Boucher said. Our leaders have been doing a great job, but collectively, the team has grown dramatically in terms of having poise, executing under pressure.    Game notes : Mitch Marner sat out for Toronto with an upper-body injury. Marner leads Toronto with 48 points and is also expected to sit Sunday night against Carolina."

# Print new_text to the console
new_text

# Find the 20 most frequent terms: term_count
term_count <- freq_terms(new_text,20)

# Plot term_count
plot(term_count)
Out[2]:
'The Ottawa Senators have the Atlantic Division lead in their sights. Mark Stone had a goal and four assists, Derick Brassard scored twice in the third period and the Senators recovered after blowing a two-goal lead to beat the Toronto Maple Leafs 6-3 on Saturday night. The Senators pulled within two points of Montreal for first place in the Atlantic Division with three games in hand. We like where we\'re at. We\'re in a good spot, Stone said. But there\'s a little bit more that we want. Obviously, there\'s teams coming and we want to try and create separation, so the only way to do that is keep winning hockey games. Ottawa led 2-0 after one period but trailed 3-2 in the third before getting a tying goal from Mike Hoffman and a power-play goal from Brassard. Stone and Brassard added empty-netters, and Chris Wideman and Ryan Dzingel also scored for the Senators. Ottawa has won four of five overall and three of four against the Leafs this season. Craig Anderson stopped 34 shots. Morgan Rielly, Nazem Kadri and William Nylander scored and Auston Matthews had two assists for the Maple Leafs. Frederik Andersen allowed four goals on 40 shots. Toronto has lost eight of 11 and entered the night with a tenuous grip on the final wild-card spot in the Eastern Conference. The reality is we\'re all big boys, we can read the standings. You\'ve got to win hockey games, Babcock said. After Nylander made it 3-2 with a power-play goal 2:04 into the third, Hoffman tied it by rifling a shot from the right faceoff circle off the post and in. On a power play 54 seconds later, Andersen stopped Erik Karlsson\'s point shot, but Brassard jumped on the rebound and put it in for a 4-3 lead. Wideman started the scoring in the first, firing a point shot through traffic moments after Stone beat Nikita Zaitsev for a puck behind the Leafs goal. Dzingel added to the lead when he deflected Marc Methot\'s point shot 20 seconds later. Andersen stopped three shots during a lengthy 5-on-3 during the second period, and the Leafs got on the board about three minutes later. Rielly scored with 5:22 left in the second by chasing down a wide shot from Matthews, carrying it to the point and shooting through a crowd in front. About three minutes later, Zaitsev fired a shot from the right point that sneaked through Anderson\'s pads and slid behind the net. Kadri chased it down and banked it off Dzingel\'s helmet and in for his 24th goal of the season. Dzingel had fallen in the crease trying to prevent Kadri from stuffing the rebound in. Our game plan didn\'t change for the third period, and that\'s just the maturity we\'re gaining over time, Senators coach Guy Boucher said. Our leaders have been doing a great job, but collectively, the team has grown dramatically in terms of having poise, executing under pressure. Game notes : Mitch Marner sat out for Toronto with an upper-body injury. Marner leads Toronto with 48 points and is also expected to sit Sunday night against Carolina.'
Out[2]:

Back to top

SENS RECAPS DATA

In [3]:
# Import text data
recaps <- read.csv(file="Data/Recap_data.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)

# View the structure of recaps
str(recaps)

# Print out the number of rows in recaps
nrow(recaps)
'data.frame':	101 obs. of  34 variables:
 $ GP          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ X0_Type     : chr  "1_Regular" "1_Regular" "1_Regular" "1_Regular" ...
 $ Date        : chr  "10/12/2016" "10/15/2016" "10/17/2016" "10/18/2016" ...
 $ Time        : chr  "7:00 PM" "7:00 PM" "7:30 PM" "7:30 PM" ...
 $ X           : chr  "" "" "A" "" ...
 $ Opponent    : chr  "Toronto Maple Leafs" "Montreal Canadiens" "Detroit Red Wings" "Arizona Coyotes" ...
 $ GF          : int  5 4 1 7 1 3 2 2 2 1 ...
 $ GA          : int  4 3 5 4 4 0 5 0 1 0 ...
 $ Result      : chr  "W" "W" "L" "W" ...
 $ OT_SO       : chr  "OT" "SO" "" "" ...
 $ W_Record    : int  1 2 2 3 3 4 4 5 6 7 ...
 $ L_Record    : int  0 0 1 1 2 2 3 3 3 3 ...
 $ OL_Record   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Streak      : chr  "W 1" "W 2" "L 1" "W 1" ...
 $ OTT_S       : int  30 38 32 42 28 28 33 22 32 24 ...
 $ OTT_PIM     : int  13 10 22 14 8 2 4 11 13 20 ...
 $ OTT_PPG     : int  0 0 0 1 0 0 2 0 0 0 ...
 $ OTT_PPO     : int  2 4 3 2 3 1 4 2 2 4 ...
 $ OTT_SHO     : int  0 0 1 1 0 0 0 0 0 0 ...
 $ OPP_S       : int  38 24 25 35 35 22 19 37 33 27 ...
 $ OPP_PIM     : int  11 10 20 6 6 2 8 9 11 22 ...
 $ OPP_PPG     : int  0 1 2 1 2 0 0 0 0 0 ...
 $ OPP_PPO     : int  4 4 4 5 4 1 2 4 3 2 ...
 $ OPP_SHG     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ATT         : chr  "17,618" "18,195" "20,027" "11,061" ...
 $ LOG         : chr  "2:36" "2:44" "2:33" "2:43" ...
 $ AP_Headline : chr  "Maple Leafs\xcd Matthews has modern record 4 goals in NHL debut" "Karlsson gets shootout winner as Senators edge Canadiens 4-3" "Green's 1st hat trick helps Red Wings beat Senators 5-1" "Pyatt, Stone, Kelly lead Ottawa over Arizona 7-4" ...
 $ AP_Recap    : chr  "Auston Matthews needed 40 minutes to get into the NHL record book. In the highest-scoring debut in modern NHL history, Matthews "Guy Boucher trusted his instincts when selecting skaters for the shootout and it paid off for the Ottawa Senators. The Senators "Mike Green scored three times for his first hat trick and Darren Helm had two goals to help the Detroit Red Wings beat the Otta "Just four games into the season and the Ottawa Senators are already looking to improve on their consistency. Five Senators had  ...
 $ SSS_Author  : chr  "Ross A" "Ary M" "Michaela Schreiter" "Ary M" ...
 $ SSS_Headline: chr  "Auston Matthews Loses 5-4 to Sens in OT" "Karlsson, Sens down Habs 4-3 in the shootout" "Red Wings hand Senators first loss of the season" "Hoffman, Pyatt help Sens round up \xd4Yotes, 7 - 4" ...
 $ SSS_Recap   : chr  "The NHL.com headline for the game was \xd2Auston Matthews scores four goals, Maple Leafslose\xd3, and really that sums up the e "It wasn\xd5t easy \xd1 it never is \xd1 but the Sens showed the progress that Coach Boucher promised in their second game of th "We all knew it wouldn't be this easy, didn't we? After going 2-0-0 in their first two games, the Ottawa Senators landed in Detr "Remember when games against the Coyotes used to be boring, or ended up in Mikkel Boedker hat-tricks? This one was anything but, ...
 $ OPP_Blog    : chr  "Pension Plan Puppet" "Eyes on the Prize" "Winging It in Motown" "Five For Howling" ...
 $ OPP_Title   : chr  "Sens 5, Auston Matthews 4 (OT)" "Petry\xd5s heroics, Lehkonen\xd5s first goal not enough to top Sens" "An Ever-Green Night At The Joe: Red Wings 5, Senators 1" "Mike Smith injured in Arizona Coyotes\xd5 7-4 loss in Ottawa" ...
 $ OPP_Recap   : chr  "The first period started exactly the way that Leafs' fans wanted it to. Each line looked strong, and each of the kids looked fa "The Montreal Canadiens headed into the Canadian Tire Centre to take on the Ottawa Senators, without the services of their start "Tonight the Red Wings hosted their 36th and final home opener in Joe Louis Arena against long time rival..... .....the Senators "Dylan Strome was obviously disappointed to not play in the Arizona Coyotes\xd5 season opener. But when his number was called to ...
Out[3]:
101
In [5]:
# Isolate text from recaps: AP.recaps
AP.recaps <- recaps$AP_Recap

# Show first 6 recaps
head(AP.recaps)
WARNING: Some output was deleted.

There are odd characters in the game recaps (\xd1?), which highlight some issue with text encoding and formatting. Let's revisit the last few steps with a slightly different data file.

In [6]:
# Import text data
recaps <- read.csv(file="Data/Recap_data_first_pass.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)

# View the structure of recaps
str(recaps)

# Print out the number of rows in recaps
nrow(recaps)

# Isolate text from recaps: AP.recaps
AP.recaps <- recaps$AP.Recap
'data.frame':	101 obs. of  34 variables:
 $ GP          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ X0.Type     : chr  "1 Regular" "1 Regular" "1 Regular" "1 Regular" ...
 $ Date        : chr  "10/12/2016" "10/15/2016" "10/17/2016" "10/18/2016" ...
 $ Time        : chr  "7:00 PM" "7:00 PM" "7:30 PM" "7:30 PM" ...
 $ X           : chr  "" "" "A" "" ...
 $ Opponent    : chr  "Toronto Maple Leafs" "Montreal Canadiens" "Detroit Red Wings" "Arizona Coyotes" ...
 $ GF          : int  5 4 1 7 1 3 2 2 2 1 ...
 $ GA          : int  4 3 5 4 4 0 5 0 1 0 ...
 $ Result      : chr  "W" "W" "L" "W" ...
 $ OT.SO       : chr  "OT" "SO" "" "" ...
 $ W.Record    : int  1 2 2 3 3 4 4 5 6 7 ...
 $ L.Record    : int  0 0 1 1 2 2 3 3 3 3 ...
 $ OL.Record   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Streak      : chr  "W 1" "W 2" "L 1" "W 1" ...
 $ OTT.S       : int  30 38 32 42 28 28 33 22 32 24 ...
 $ OTT.PIM     : int  13 10 22 14 8 2 4 11 13 20 ...
 $ OTT.PPG     : int  0 0 0 1 0 0 2 0 0 0 ...
 $ OTT.PPO     : int  2 4 3 2 3 1 4 2 2 4 ...
 $ OTT.SHG     : int  0 0 1 1 0 0 0 0 0 0 ...
 $ OPP.S       : int  38 24 25 35 35 22 19 37 33 27 ...
 $ OPP.PIM     : int  11 10 20 6 6 2 8 9 11 22 ...
 $ OPP.PPG     : int  0 1 2 1 2 0 0 0 0 0 ...
 $ OPP.PPO     : int  4 4 4 5 4 1 2 4 3 2 ...
 $ OPP.SHG     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ATT         : chr  "17,618" "18,195" "20,027" "11,061" ...
 $ LOG         : chr  "2:36" "2:44" "2:33" "2:43" ...
 $ AP.Headline : chr  "Maple Leafs' Matthews has modern record 4 goals in NHL debut" "Karlsson gets shootout winner as Senators edge Canadiens 4-3" "Green's 1st hat trick helps Red Wings beat Senators 5-1" "Pyatt, Stone, Kelly lead Ottawa over Arizona 7-4" ...
 $ AP.Recap    : chr  "Auston Matthews needed 40 minutes to get into the NHL record book. In the highest-scoring debut in modern NHL h"| __truncated__ "Guy Boucher trusted his instincts when selecting skaters for the shootout and it paid off for the Ottawa Senato"| __truncated__ "Mike Green scored three times for his first hat trick and Darren Helm had two goals to help the Detroit Red Win"| __truncated__ "Just four games into the season and the Ottawa Senators are already looking to improve on their consistency. Fi"| __truncated__ ...
 $ SSS.Author  : chr  "Ross A" "Ary M" "Michaela Schreiter" "Ary M" ...
 $ SSS.Headline: chr  "Auston Matthews Loses 5-4 to Sens in OT" "Karlsson, Sens down Habs 4-3 in the shootout" "Red Wings hand Senators first loss of the season" "Hoffman, Pyatt help Sens round up \x82Yotes, 7 - 4" ...
 $ SSS.Recap   : chr  "The NHL.com headline for the game was _Auston Matthews scores four goals, Maple Leafslose\xee, and really that sums up the enti "It wasn\x90t easy \xca it never is \xca but the Sens showed the progress that Coach Boucher promised in their second game of th "We all knew it wouldn't be this easy, didn't we? After going 2-0-0 in their first two games, the Ottawa Senators landed in Detr "Remember when games against the Coyotes used to be boring, or ended up in Mikkel Boedker hat-tricks? This one was anything but, ...
 $ OPP.Blog    : chr  "Pension Plan Puppet" "Eyes on the Prize" "Winging It in Motown" "Five For Howling" ...
 $ OPP.Title   : chr  "Sens 5, Auston Matthews 4 (OT)" "Petry\x90s heroics, Lehkonen\x90s first goal not enough to top Sens" "An Ever-Green Night At The Joe: Red Wings 5, Senators 1" "Mike Smith injured in Arizona Coyotes\x90 7-4 loss in Ottawa" ...
 $ OPP.Recap   : chr  "The first period started exactly the way that Leafs' fans wanted it to. Each line looked strong, and each of the kids looked fa "The Montreal Canadiens headed into the Canadian Tire Centre to take on the Ottawa Senators, without the services of their start "Tonight the Red Wings hosted their 36th and final home opener in Joe Louis Arena against long time rival..... .....the Senators "Dylan Strome was obviously disappointed to not play in the Arizona Coyotes\x90 season opener. But when his number was called to ...
Out[6]:
101
In [8]:
# Show first 6 recaps
head(AP.recaps)
WARNING: Some output was deleted.

For reasons that are perhaps too technical to get into at this point, the encoding of Recap_data_first_pass.csv creates issues with tm and qdap down the road, but the issues disappear when we use a different encoding (UTF-8).

In [9]:
# Import text data
recaps <- read.csv(file="Data/Recap_data_first_pass_utf8.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)

# Isolate text from recaps: AP.recaps
AP.recaps <- recaps$AP.Recap

Back to top

VCORPUS FROM VECTOR WITH tm

In [10]:
# Make a vector source: AP.recaps.source
AP.recaps.source <- VectorSource(AP.recaps)

# Make a volatile corpus: AP.recaps.corpus
AP.recaps.corpus <- VCorpus(AP.recaps.source)

# Print out AP.recaps.corpus
AP.recaps.corpus

# Print data on the 15th recap AP.recaps.corpus
AP.recaps.corpus[[15]]

# Print the content of the 15th recap in AP.recaps.corpus
AP.recaps.corpus[[15]][1]

# Print the meta of the 15th recap in AP.recaps.corpus
AP.recaps.corpus[[15]][2]
Out[10]:
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 101
Out[10]:
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 2871
Out[10]:
$content = 'For a team playing its third game in four nights, the Minnesota Wild looked plenty fresh on Sunday night -- even in overtime. Matt Dumba scored late in the extra session and Darcy Kuemper stopped 35 shots, helping Minnesota beat the Ottawa Senators 2-1. The Wild were coming off a 3-2 loss to Philadelphia on Saturday after beating Pittsburgh on Thursday. It\'s not the end of the world to play back-to-backs, and I thought we held on and did a good job, Wild coach Bruce Boudreau said. Ryan Suter scored a short-handed goal in the first period and Kuemper helped the Wild kill off three early power plays. We\'re not a team that wants to get behind 3-0, so getting those three kills and getting out of the first period with a lead was really important for us I\'d say, Boudreau said. Craig Anderson made 40 saves and was again solid for the Senators, who got a goal from Kyle Turris 5:06 into the third period. Ottawa has 11 goals over its last eight games and are 1 for 24 on the power play. I love the way we\'re playing, we\'re giving ourselves a chance to win by being there at the end, Anderson said. The positive out of not scoring right now is that even though guys might be showing some frustration, they\'re still doing their jobs on the defensive side of the puck which is allowing us to get points and give ourselves an opportunity to be in each game. Despite their hectic schedule of late, the Wild controlled much of the action with Ottawa looking disorganized for most of the night. We had all kinds of scoring chances, but we just can\'t find the back of the net, Senators coach Guy Boucher said. It\'s a matter of creating the same chances and then having some go in and you\'re able to relax and not grip the stick so tight. Turris finally got Ottawa on the board when he beat Kuemper far stick-side with a wrist shot. boucher \'s decision to dress seven defenseman worked out when Marc Methot left after the first period with a lower-body injury and did not return. Boucher said after the game that Ottawa knew Methot was dealing with an issue, which is why the Senators used an extra defender. Suter scored late in the first and Ottawa continued to struggle in the second, leaving Anderson to keep the team in the game. He made huge saves on Nino Niederreiter and Erik Staal to keep it 1-0. The Senators had four power-play opportunities and struggled to create offense, a common refrain for the team as of late. The Wild made the Senators\' power play look even worse when they scored short-handed. Staal got off a shot and Suter was there for the rebound. NOTES: LW Matt Puempel was a late healthy scratch for the Senators. Minnesota LW Zach Parise (lower body) missed his sixth straight game. C Joel Eriksson and D Nate Prosser were a healthy scratch. UP NEXT Wild: Host Calgary on Tuesday night. Senators: Play at Philadelphia on Tuesday night. '
Out[10]:
$meta
  author       : character(0)
  datetimestamp: 2019-03-12 15:07:18
  description  : character(0)
  heading      : character(0)
  id           : 15
  language     : en
  origin       : character(0)

We can also take a look at some basic statistics regarding the number of characters and the number of words in the game recaps.

In [11]:
# Statistics on the recap's number of characters 
length_of_recaps_char <- vector(mode="numeric", length=nrow(recaps))
for(j in 1:nrow(recaps)){length_of_recaps_char[j]=nchar(AP.recaps.corpus[[j]][1])}

hist(length_of_recaps_char, freq=F, main="Distribution of # of characters in Senators game recaps (16-17)")
summary(length_of_recaps_char)

# Statistics on the recap's number of words 
length_of_recaps_word <- vector(mode="numeric", length=nrow(recaps))
for(j in 1:nrow(recaps)){length_of_recaps_word[j]=length(strsplit(gsub(' {2,}',' ',AP.recaps.corpus[[j]][1]),' ')[[1]])}

hist(length_of_recaps_word, freq=F, main="Distribution of # of words in Senators game recaps (16-17)")
summary(length_of_recaps_word)
Out[11]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2027    3223    3689    3683    4227    5087 
Out[11]:
Out[11]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    375     565     665     664     774     921 
Out[11]:

Back to top

PRE-PROCESSING A DOCUMENT WITH tm

In [12]:
# Create the object: text
text <- "<i>He</i> went to bed at       2 A.M. It\'s way too late!  He was only 20% asleep at first, but sleep eventually came."
text
Out[12]:
'<i>He</i> went to bed at 2 A.M. It\'s way too late! He was only 20% asleep at first, but sleep eventually came.'
In [13]:
# All lowercase
tolower(text)

# Remove punctuation
removePunctuation(text)

# Remove numbers
removeNumbers(text)

# Remove whitespace
stripWhitespace(text)
Out[13]:
'<i>he</i> went to bed at 2 a.m. it\'s way too late! he was only 20% asleep at first, but sleep eventually came.'
Out[13]:
'iHei went to bed at 2 AM Its way too late He was only 20 asleep at first but sleep eventually came'
Out[13]:
'<i>He</i> went to bed at A.M. It\'s way too late! He was only % asleep at first, but sleep eventually came.'
Out[13]:
'<i>He</i> went to bed at 2 A.M. It\'s way too late! He was only 20% asleep at first, but sleep eventually came.'

Back to top

PRE-PROCESSING A DOCUMENT WITH qdap

In [14]:
# Remove text within brackets
bracketX(text)

# Replace numbers with words
replace_number(text)

# Replace abbreviations
replace_abbreviation(text)

# Replace contractions
replace_contraction(text)

# Replace symbols with words
replace_symbol(text)
Out[14]:
'He went to bed at 2 A.M. It\'s way too late! He was only 20% asleep at first, but sleep eventually came.'
Out[14]:
'<i>He</i> went to bed at two A.M. It\'s way too late! He was only twenty% asleep at first, but sleep eventually came.'
Out[14]:
'<i>He</i> went to bed at 2 AM It\'s way too late! He was only 20% asleep at first, but sleep eventually came.'
Out[14]:
'<i>He</i> went to bed at 2 A.M. it is way too late! He was only 20% asleep at first, but sleep eventually came.'
Out[14]:
'<i>He</i> went to bed at 2 A.M. It\'s way too late! He was only 20 percent asleep at first, but sleep eventually came.'

Back to top

STOPWORDS

In [15]:
# List standard English stop words
stopwords("en")

# Print text without standard stop words
removeWords(text,stopwords("en"))

# Add "sleep" and "asleep" to the list: new_stops
new_stops <- c("sleep","asleep",stopwords("en"))

# Remove stop words from text
removeWords(text,new_stops)
Out[15]:
  1. 'i'
  2. 'me'
  3. 'my'
  4. 'myself'
  5. 'we'
  6. 'our'
  7. 'ours'
  8. 'ourselves'
  9. 'you'
  10. 'your'
  11. 'yours'
  12. 'yourself'
  13. 'yourselves'
  14. 'he'
  15. 'him'
  16. 'his'
  17. 'himself'
  18. 'she'
  19. 'her'
  20. 'hers'
  21. 'herself'
  22. 'it'
  23. 'its'
  24. 'itself'
  25. 'they'
  26. 'them'
  27. 'their'
  28. 'theirs'
  29. 'themselves'
  30. 'what'
  31. 'which'
  32. 'who'
  33. 'whom'
  34. 'this'
  35. 'that'
  36. 'these'
  37. 'those'
  38. 'am'
  39. 'is'
  40. 'are'
  41. 'was'
  42. 'were'
  43. 'be'
  44. 'been'
  45. 'being'
  46. 'have'
  47. 'has'
  48. 'had'
  49. 'having'
  50. 'do'
  51. 'does'
  52. 'did'
  53. 'doing'
  54. 'would'
  55. 'should'
  56. 'could'
  57. 'ought'
  58. 'i\'m'
  59. 'you\'re'
  60. 'he\'s'
  61. 'she\'s'
  62. 'it\'s'
  63. 'we\'re'
  64. 'they\'re'
  65. 'i\'ve'
  66. 'you\'ve'
  67. 'we\'ve'
  68. 'they\'ve'
  69. 'i\'d'
  70. 'you\'d'
  71. 'he\'d'
  72. 'she\'d'
  73. 'we\'d'
  74. 'they\'d'
  75. 'i\'ll'
  76. 'you\'ll'
  77. 'he\'ll'
  78. 'she\'ll'
  79. 'we\'ll'
  80. 'they\'ll'
  81. 'isn\'t'
  82. 'aren\'t'
  83. 'wasn\'t'
  84. 'weren\'t'
  85. 'hasn\'t'
  86. 'haven\'t'
  87. 'hadn\'t'
  88. 'doesn\'t'
  89. 'don\'t'
  90. 'didn\'t'
  91. 'won\'t'
  92. 'wouldn\'t'
  93. 'shan\'t'
  94. 'shouldn\'t'
  95. 'can\'t'
  96. 'cannot'
  97. 'couldn\'t'
  98. 'mustn\'t'
  99. 'let\'s'
  100. 'that\'s'
  101. 'who\'s'
  102. 'what\'s'
  103. 'here\'s'
  104. 'there\'s'
  105. 'when\'s'
  106. 'where\'s'
  107. 'why\'s'
  108. 'how\'s'
  109. 'a'
  110. 'an'
  111. 'the'
  112. 'and'
  113. 'but'
  114. 'if'
  115. 'or'
  116. 'because'
  117. 'as'
  118. 'until'
  119. 'while'
  120. 'of'
  121. 'at'
  122. 'by'
  123. 'for'
  124. 'with'
  125. 'about'
  126. 'against'
  127. 'between'
  128. 'into'
  129. 'through'
  130. 'during'
  131. 'before'
  132. 'after'
  133. 'above'
  134. 'below'
  135. 'to'
  136. 'from'
  137. 'up'
  138. 'down'
  139. 'in'
  140. 'out'
  141. 'on'
  142. 'off'
  143. 'over'
  144. 'under'
  145. 'again'
  146. 'further'
  147. 'then'
  148. 'once'
  149. 'here'
  150. 'there'
  151. 'when'
  152. 'where'
  153. 'why'
  154. 'how'
  155. 'all'
  156. 'any'
  157. 'both'
  158. 'each'
  159. 'few'
  160. 'more'
  161. 'most'
  162. 'other'
  163. 'some'
  164. 'such'
  165. 'no'
  166. 'nor'
  167. 'not'
  168. 'only'
  169. 'own'
  170. 'same'
  171. 'so'
  172. 'than'
  173. 'too'
  174. 'very'
Out[15]:
'<>He</> went bed 2 A.M. It\'s way late! He 20% asleep first, sleep eventually came.'
Out[15]:
'<>He</> went bed 2 A.M. It\'s way late! He 20% first, eventually came.'

Now combine some pre-processing steps into one call:

In [16]:
tolower(
    stripWhitespace(
        removeWords(
            removePunctuation(
                replace_symbol(
                    replace_contraction(
                        replace_abbreviation(
                            bracketX(text)
                        )
                    )
                )
            )
        ,stopwords("en"))
    )
)
Out[16]:
'he went bed 2 am way late he 20 percent asleep first sleep eventually came'

Back to top

WORD STEMMING AND STEM COMPLETION

In [17]:
# Create sleep
(sleep <- c("sleepful","sleeps","sleeping"))

# Perform word stemming: stem_doc
(stem_doc <- stemDocument(sleep))

# Create the completion dictionary: sleep_dict
sleep_dict <- c("sleep")

# Perform stem completion: complete_text 
complete_text <- stemCompletion(stem_doc,sleep_dict)

# Print complete_text
complete_text
Out[17]:
  1. 'sleepful'
  2. 'sleeps'
  3. 'sleeping'
Out[17]:
  1. 'sleep'
  2. 'sleep'
  3. 'sleep'
Out[17]:
sleep
'sleep'
sleep
'sleep'
sleep
'sleep'
In [18]:
(text_data <- "In sleepful nights, Katia sleeps to achieve sleeping.")
(comp_dict <- c("In","sleep","nights","Katia","to","achieve"))

# Remove punctuation: rm_punc
rm_punc <- removePunctuation(text_data)

# Create character vector: n_char_vec
n_char_vec <- unlist(strsplit(rm_punc, split = ' '))

# Perform word stemming: stem_doc
stem_doc <- stemDocument(n_char_vec)

# Print stem_doc
stem_doc

# Re-complete stemmed document: complete_doc
complete_doc <- stemCompletion(stem_doc,comp_dict)

# Print complete_doc
complete_doc
Out[18]:
'In sleepful nights, Katia sleeps to achieve sleeping.'
Out[18]:
  1. 'In'
  2. 'sleep'
  3. 'nights'
  4. 'Katia'
  5. 'to'
  6. 'achieve'
Out[18]:
  1. 'In'
  2. 'sleep'
  3. 'night'
  4. 'Katia'
  5. 'sleep'
  6. 'to'
  7. 'achiev'
  8. 'sleep'
Out[18]:
In
'In'
sleep
'sleep'
night
'nights'
Katia
'Katia'
sleep
'sleep'
to
'to'
achiev
'achieve'
sleep
'sleep'

Back to top

PRE-PROCESSING A CORPUS

In [19]:
# Create a function to clean the corpus, mixing tm and qdap functions
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stemDocument)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"))) 
  return(corpus)
}

# Apply your customized function to the AP.recaps.corpus: clean_corp.AP.recaps
clean_corp.AP.recaps <- clean_corpus(AP.recaps.corpus)

# Print out a cleaned up recap
clean_corp.AP.recaps[[15]][1]

# Print out the same tweet in original form
recaps$AP.Recap[15]
Out[19]:
$content = ' team play third game four night minnesota wild look plenti fresh sunday night even overtim matt dumba score late extra session darci kuemper stop shot help minnesota beat ottawa senat wild come loss philadelphia saturday beat pittsburgh thursday end world play backtoback thought held good job wild coach bruce boudreau said ryan suter score shorthand goal first period kuemper help wild kill three earli power play team want get behind get three kill get first period lead realli import us id say boudreau said craig anderson made save solid senat got goal kyle turri third period ottawa goal last eight game power play love way play give ourselv chanc win end anderson said posit score right now even though guy might show frustrat theyr still job defens side puck allow us get point give ourselv opportun game despit hectic schedul late wild control much action ottawa look disorgan night kind score chanc just cant find back net senat coach guy boucher said matter creat chanc go abl relax grip stick tight turri final got ottawa board beat kuemper far sticksid wrist shot boucher s decis dress seven defenseman work marc methot left first period lowerbodi injuri return boucher said game ottawa knew methot deal issu whi senat use extra defend suter score late first ottawa continu struggl second leav anderson keep team game made huge save nino niederreit erik staal keep senat four powerplay opportun struggl creat offens common refrain team late wild made senat power play look even wors score shorthand staal got shot suter rebound notes lw matt puempel late healthi scratch senat minnesota lw zach paris lower bodi miss sixth straight game c joel eriksson d nate prosser healthi scratch next wild host calgari tuesday night senat play philadelphia tuesday night'
Out[19]:
'For a team playing its third game in four nights, the Minnesota Wild looked plenty fresh on Sunday night -- even in overtime. Matt Dumba scored late in the extra session and Darcy Kuemper stopped 35 shots, helping Minnesota beat the Ottawa Senators 2-1. The Wild were coming off a 3-2 loss to Philadelphia on Saturday after beating Pittsburgh on Thursday. It\'s not the end of the world to play back-to-backs, and I thought we held on and did a good job, Wild coach Bruce Boudreau said. Ryan Suter scored a short-handed goal in the first period and Kuemper helped the Wild kill off three early power plays. We\'re not a team that wants to get behind 3-0, so getting those three kills and getting out of the first period with a lead was really important for us I\'d say, Boudreau said. Craig Anderson made 40 saves and was again solid for the Senators, who got a goal from Kyle Turris 5:06 into the third period. Ottawa has 11 goals over its last eight games and are 1 for 24 on the power play. I love the way we\'re playing, we\'re giving ourselves a chance to win by being there at the end, Anderson said. The positive out of not scoring right now is that even though guys might be showing some frustration, they\'re still doing their jobs on the defensive side of the puck which is allowing us to get points and give ourselves an opportunity to be in each game. Despite their hectic schedule of late, the Wild controlled much of the action with Ottawa looking disorganized for most of the night. We had all kinds of scoring chances, but we just can\'t find the back of the net, Senators coach Guy Boucher said. It\'s a matter of creating the same chances and then having some go in and you\'re able to relax and not grip the stick so tight. Turris finally got Ottawa on the board when he beat Kuemper far stick-side with a wrist shot. boucher \'s decision to dress seven defenseman worked out when Marc Methot left after the first period with a lower-body injury and did not return. Boucher said after the game that Ottawa knew Methot was dealing with an issue, which is why the Senators used an extra defender. Suter scored late in the first and Ottawa continued to struggle in the second, leaving Anderson to keep the team in the game. He made huge saves on Nino Niederreiter and Erik Staal to keep it 1-0. The Senators had four power-play opportunities and struggled to create offense, a common refrain for the team as of late. The Wild made the Senators\' power play look even worse when they scored short-handed. Staal got off a shot and Suter was there for the rebound. NOTES: LW Matt Puempel was a late healthy scratch for the Senators. Minnesota LW Zach Parise (lower body) missed his sixth straight game. C Joel Eriksson and D Nate Prosser were a healthy scratch. UP NEXT Wild: Host Calgary on Tuesday night. Senators: Play at Philadelphia on Tuesday night. '

One thing to keep in mind: there is no secret pre-processing formula that will work with all corpora. Context is king/queen.

Let's revisit the first text we looked at:

In [20]:
# Find the 20 most frequent terms: term_count
term_count <- freq_terms(clean_corp.AP.recaps[[56]][1],20)

# Plot term_count
plot(term_count)
Out[20]:

Back to top

DOCUMENT-TERM MATRIX

In [21]:
# Create the dtm from clean_corp.AP.recaps: AP.recaps_dtm
AP.recaps_dtm <- DocumentTermMatrix(clean_corp.AP.recaps)

# Print out AP.recaps_dtm data
AP.recaps_dtm

# Convert AP.recaps_dtm to a matrix: AP.recaps_m
AP.recaps_m <- as.matrix(AP.recaps_dtm)

# Print the dimensions of AP.recaps_m
dim(AP.recaps_m)

# Review a portion of the matrix
AP.recaps_m[79:84, 1005:1010]
Out[21]:
<<DocumentTermMatrix (documents: 101, terms: 3293)>>
Non-/sparse entries: 22187/310406
Sparsity           : 93%
Maximal term length: 15
Weighting          : term frequency (tf)
Out[21]:
  1. 101
  2. 3293
Out[21]:
ferlandfewerfibulafieldfifthfifthround
79000010
80000000
81000000
82000000
83000000
84000000

Back to top

TERM-DOCUMENT MATRIX

In [22]:
# Create a TDM from clean_corp.AP.recaps: AP.recaps_tdm
AP.recaps_tdm <- TermDocumentMatrix(clean_corp.AP.recaps)

# Print AP.recaps_tdm data
AP.recaps_tdm

# Convert AP.recaps_tdm to a matrix: AP.recaps_m
AP.recaps_m <- as.matrix(AP.recaps_tdm)

# Print the dimensions of the matrix
dim(AP.recaps_m)

# Review a portion of the matrix
AP.recaps_m[1005:1010, 79:84]
Out[22]:
<<TermDocumentMatrix (terms: 3293, documents: 101)>>
Non-/sparse entries: 22187/310406
Sparsity           : 93%
Maximal term length: 15
Weighting          : term frequency (tf)
Out[22]:
  1. 3293
  2. 101
Out[22]:
798081828384
ferland000000
fewer000000
fibula000000
field000000
fifth100000
fifthround000000

Back to top

BARCHART OF FREQUENT TERMS

In [23]:
# Calculate the rowSums: term_frequency
term_frequency <- rowSums(AP.recaps_m)

# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing=TRUE)

# View the top 20 most common words
term_frequency[1:20]

# Plot a barchart of the 20 most common words
barplot(term_frequency[1:20], col = "tan", las = 2)
Out[23]:
game
843
senat
720
goal
584
play
512
ottawa
502
score
497
said
493
second
417
first
402
shot
398
period
370
just
277
night
257
two
240
season
238
anderson
237
third
233
get
230
made
218
point
200
Out[23]:

Back to top

WORDCLOUD OF FREQUENT TERMS

In [24]:
# Load wordcloud package
library('wordcloud')

# Print the first 20 entries in term_frequency
term_frequency[1:20]

# Create word_freqs
word_freqs = data.frame(term_frequency)
word_freqs$term = rownames(word_freqs)
word_freqs = word_freqs[,c(2,1)]
colnames(word_freqs)=c("term","num")

# Create a wordcloud for the values in word_freqs
wordcloud(word_freqs$term, word_freqs$num, max.words=100, colors="red")
Out[24]:
game
843
senat
720
goal
584
play
512
ottawa
502
score
497
said
493
second
417
first
402
shot
398
period
370
just
277
night
257
two
240
season
238
anderson
237
third
233
get
230
made
218
point
200
Out[24]:

Back to top

REPRISE

Let's try this again, but this time on a cleaned corpus:

In [25]:
# Alter the function code to match the instructions
clean_corpus_Sens <- function(corpus){
  corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stemDocument)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "game", "first", "second", "third", "Ottawa", "Senators")) # because the recaps are about the Sens, and that info would dominate
  return(corpus)
}

# Apply your customized function to the AP.recaps.corpus: clean_corp2.AP.recaps
clean_corp2.AP.recaps <- clean_corpus_Sens(AP.recaps.corpus)

# Create a TDM from clean_corp2.AP.recaps: AP.recaps2_tdm
AP.recaps2_tdm <- TermDocumentMatrix(clean_corp2.AP.recaps)

# Convert AP.recaps2_tdm to a matrix: AP.recaps2_m
AP.recaps2_m <- as.matrix(AP.recaps2_tdm)

# Calculate the rowSums: term_frequency2
term_frequency2 <- rowSums(AP.recaps2_m)

# Sort term_frequency2 in descending order
term_frequency2 <- sort(term_frequency2, decreasing=TRUE)

# Print the first 20 entries in term_frequency2
term_frequency2[1:20]

# Plot a barchart of the 20 most common words
barplot(term_frequency2[1:20], col = "tan", las = 2)

# Create word_freqs2
word_freqs2 = data.frame(term_frequency2)
word_freqs2$term = rownames(word_freqs2)
word_freqs2 = word_freqs2[,c(2,1)]
colnames(word_freqs2)=c("term","num")

# Create a wordcloud for the values in word_freqs2
wordcloud(word_freqs2$term, word_freqs2$num, max.words=100, colors="red")
Out[25]:
senat
720
goal
584
play
512
ottawa
502
score
497
said
493
shot
398
period
370
just
277
night
257
two
240
season
238
anderson
237
get
230
made
218
point
200
save
199
team
195
lead
194
got
193
Out[25]:
Out[25]:

We could see how often Senator players/coach appear in these recaps.

In [26]:
# Senators players and coach surnames 
keep=c("anderson","borowiecki","boucher","brassard","burrows","ceci","chabot","chiasson","claesson","condon","didomenico","drieger","hammond","hoffman","jokipakka","karlsson","lazar","macarthur","mccormick","methot","moore","pageau","phaneuf","puempel","pyatt","ryan","ryans","smith","stalberg","stone","white","wideman","wingels")

# Only keep the Senators surnames
word_freqs3 = word_freqs2[word_freqs2$term %in% keep, ]
In [27]:
# Plot a barchart of the Senators players and coach
barplot(term_frequency2[word_freqs2$term %in% keep], col = "tan", las = 2)

# Create a wordcloud for the values in word_freqs3
wordcloud(word_freqs3$term, word_freqs3$num, max.words=100, colors="red")
Out[27]:
Out[27]:

Can we conclude anything about the Senators season from these graphs?


Back to top

EXERCISES

  • conduct a similar analysis using the fields SSS_Recap and OPP_Recap
  • conduct a similar analysis using the fields AP_Headline, SSS_Headline and OPP_Title
In [0]: